# mounting my google drive
from google.colab import drive
drive.mount('/content/drive')
# changing the directory to my folder designated for this project
%cd /content/drive/My Drive/TU/SEMESTERS/f2022/data_final_proj
# importing important packages I will need
import matplotlib.pyplot as plt
import pandas as pd
!pip install geopandas
import geopandas as gpd
import numpy as np
import requests
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
Determine whether the driver or the constructor plays a larger role in Formula One success
- I will utilize a regression discontinuity framework to identify the effect that switching teams had to a driver's average lap time, pitstop time, fastest speed, and average starting and finishing positions for a given year. If the driver switches to a 'better' team, it is expected that average lap time and pitstop time will decrease and average fastest speed will increase.
Build a regression model that will predict how a driver's average finishing position will be affected if they switch teams between seasons.
- I will utilize historical data on the current 10 Formula One teams to build a regression model that will predict how a driver's average finishing position for a will be affected if they switch teams. I will consider the following variables when building my model: lap times, pitstop times, average starting position, average finishing position, fastest lap speed and constructor reference information.
My motivations: F1 is one of the few sports that is almost completely reliant on data. Telemetry from the cars to the pits has been seen in races since the 1980s however, F1 data usage has greatly expanded since then. Dependent on the team, cars can be fitted with upwards of 300 sensors that generate over one million data points per second. Teams are limited in the number of employees that can be present on race day and thus, teams are utilizing the mass collection of data in real time and transmitting said data to analysts and data engineers at an off-site location for immediate feedback. Drivers are able to not only rely on their gut instinct and years of training, but now they can also rely on models and predictions in real time to make difficult decisions. This is why I would consider F1 analytics to be important. It is interesting to me because F1 has been rapidly expanding into the USA recently with social media promotions and deals with Netflix US and their kind of analytics would constitute my dream job and is ultimately why I decided to pursue this as my milestone topic.
Further Reading about Data in Formula One:
driver_standings.csv : This dataset illustrates the outcomes of each race. It shows how each driver did at specific races and how many points were awarded. The following variables are pertinent to our analysis: raceId, driverId, points, position, & wins (number of wins so far in the season).
drivers.csv : This dataset gives detailed information on all Formula One drivers for the past 70+ years. Within this dataset, there is information on nationality as well as age. The following variables were important in our analysis: driverId, forename, surname & nationality.
lap_times.csv : This dataset details each driver and how they performed each lap of every race. Each row signifies how a specific driver performed for each lap. We are given position after completing the lap as well as the time of lap. The following variables were used throughout my analysis: raceId, driverId, lap, milliseconds (time of lap in milliseconds), & position (driver position after lap).
pit_stops.csv : This dataset provides information about all pitstops within the races. The pitstop dataset provides detailed information on stop length, what lap the stop occurred, and which driver was stopping. The following variables were pertinent to our analysis: raceId, driverId, stop (for a specific driver at a specific race, which stop was this), lap number (lap when the stop occurred), & milliseconds (time of the pitstop in milliseconds).
# reading each of the datasets into a dataframe
circuits = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/circuits.csv")
constructor_results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructor_results.csv")
constructor_standings= pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructor_standings.csv")
constructors = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructors.csv")
driver_standings = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/driver_standings.csv")
drivers = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/drivers.csv")
lap_times = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/lap_times.csv")
pit_stops = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/pit_stops.csv")
qualifying = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/qualifying.csv")
races = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/races.csv")
results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/results.csv")
seasons = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/seasons.csv")
sprint_results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/sprint_results.csv")
status = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/status.csv")
Data Set Citations
## Red Bull Racing: Drivers by Year
url = 'https://en.wikipedia.org/wiki/Red_Bull_Racing'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
red_bull_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
red_bull_drivers.append(df_t[0])
red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['Car', 'Engine', 'Tyres', 'No.', 'Points', 'Position', 'Name'])
red_bull_drivers[3] = red_bull_drivers[3].join(red_bull_drivers[3]['Drivers'].str.split( expand =True).rename(
columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}
))
red_bull_drivers[3]['Driver1'] = red_bull_drivers[3]['First1'] + ' ' + red_bull_drivers[3]['Last1']
red_bull_drivers[3]['Driver2'] = red_bull_drivers[3]['First2'] + ' ' + red_bull_drivers[3]['Last2']
red_bull_drivers[3]['Driver3'] = red_bull_drivers[3]['First3'] + ' ' + red_bull_drivers[3]['Last3']
red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['Drivers', 'First1', 'Last1', 'First2', 'Last2', 'First3', 'Last3'])
red_bull_drivers[3].drop([18], axis=0, inplace=True)
red_bull_drivers[3] = pd.melt(red_bull_drivers[3], id_vars=['Year'])
red_bull_drivers[3] = red_bull_drivers[3].dropna()
red_bull_drivers[3]['Constructor'] = 'Red Bull'
red_bull_drivers[3] = red_bull_drivers[3].rename(columns={"value": "Driver"})
red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['variable'])
red_bull_drivers[3]['constructorId'] = '9'
red_bull_drivers[3]
## Merecedes: Drivers by Year
url = 'https://en.wikipedia.org/wiki/Mercedes-Benz_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
mercedes_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
mercedes_drivers.append(df_t[0])
mercedes_drivers[5].drop([2], axis =0, inplace =True)
mercedes_drivers[5] = mercedes_drivers[5][['Year', 'Drivers']]
# Please note that Mercredes raced drivers in 1954 and 1955 before taking a hiatus uintil 2010.
# For the purposes of my analysis on the current Mercedes team, I did not include 1954 - 1955 as the rules were different and there were many more drivers
mercedes_drivers[5].drop([0], axis =0, inplace =True)
mercedes_drivers[5].drop([1], axis =0, inplace =True)
mercedes_drivers[5].drop([16], axis =0, inplace =True)
mercedes_drivers[5] = mercedes_drivers[5].join(mercedes_drivers[5]['Drivers'].str.split( expand =True).rename(
columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}))
mercedes_drivers[5]['Driver1'] = mercedes_drivers[5]['First1'] + ' ' + mercedes_drivers[5]['Last1']
mercedes_drivers[5]['Driver2'] = mercedes_drivers[5]['First2'] + ' ' + mercedes_drivers[5]['Last2']
mercedes_drivers[5]['Driver3'] = mercedes_drivers[5]['First3'] + ' ' + mercedes_drivers[5]['Last3']
mercedes_drivers[5] = mercedes_drivers[5][['Year', 'Driver1', 'Driver2', 'Driver3']]
mercedes_drivers[5] = pd.melt(mercedes_drivers[5], id_vars=['Year'])
mercedes_drivers[5] = mercedes_drivers[5].dropna()
mercedes_drivers[5]['Constructor'] = 'Mercedes'
mercedes_drivers[5] = mercedes_drivers[5].rename(columns={"value": "Driver"})
mercedes_drivers[5] = mercedes_drivers[5].drop(columns= ['variable'])
mercedes_drivers[5]['constructorId'] = '131'
mercedes_drivers[5]
#Ferrari
url = 'https://en.wikipedia.org/wiki/Ferrari_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
ferrari_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
ferrari_drivers.append(df_t[0])
ferrari_drivers= pd.concat([ferrari_drivers[6], ferrari_drivers[7]], ignore_index=True)
ferrari_drivers = ferrari_drivers[['Year', 'Driver']]
ferrari_drivers = ferrari_drivers.dropna()
ferrari_drivers.drop([30], axis =0, inplace =True)
ferrari_drivers.drop([40], axis =0, inplace =True)
# Adding constructor name
ferrari_drivers['Constructor'] = 'Ferrari'
ferrari_drivers['constructorId'] = '6'
ferrari_drivers
# Mclaren
url = 'https://en.wikipedia.org/wiki/McLaren_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
mclaren_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
mclaren_drivers.append(df_t[0])
mclaren_drivers = pd.concat([mclaren_drivers[5], mclaren_drivers[6]], ignore_index=True)
mclaren_drivers = mclaren_drivers[['Year', 'Drivers']]
mclaren_drivers = mclaren_drivers.dropna()
mclaren_drivers.drop([33], axis =0, inplace =True)
mclaren_drivers.drop([43], axis =0, inplace =True)
# Adding constructor name
mclaren_drivers['Constructor'] = 'McLaren'
mclaren_drivers = mclaren_drivers.rename(columns={"Drivers": "Driver"})
mclaren_drivers['constructorId'] = '1'
mclaren_drivers
# BWT Alpine
url = 'https://en.wikipedia.org/wiki/Alpine_F1_Team'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
alpine_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
alpine_drivers.append(df_t[0])
alpine_drivers[1] = alpine_drivers[1][['Year', 'Drivers']]
alpine_drivers[1] = alpine_drivers[1].dropna()
alpine_drivers[1].drop([6], axis =0, inplace =True)
alpine_drivers[1]['Constructor'] = 'Alpine F1 Team'
alpine_drivers[1]['constructorId'] = '214'
alpine_drivers[1] = alpine_drivers[1].rename(columns={"Drivers": "Driver"})
alpine_drivers[1]
# Alpha Tauri
url = 'https://en.wikipedia.org/wiki/Scuderia_AlphaTauri'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
alpha_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
alpha_drivers.append(df_t[0])
alpha_drivers[1] = alpha_drivers[1][['Year', 'Drivers']]
alpha_drivers[1] = alpha_drivers[1].dropna()
alpha_drivers[1].drop([9], axis =0, inplace =True)
alpha_drivers[1]['Constructor'] = 'AlphaTauri'
alpha_drivers[1]['constructorId'] = '213'
alpha_drivers[1] = alpha_drivers[1].rename(columns={"Drivers": "Driver"})
alpha_drivers[1]
# Aston Martin
# Please note that Aston Martin also raced in 1959 - 1960, but I will not be including them in this analysis
url = 'https://en.wikipedia.org/wiki/Aston_Martin_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
aston_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
aston_drivers.append(df_t[0])
aston_drivers[3] = aston_drivers[3][['Year', 'Driver']]
aston_drivers[3] = aston_drivers[3].dropna()
aston_drivers[3].drop([7], axis =0, inplace =True)
aston_drivers[3]['Constructor'] = 'Aston Martin'
aston_drivers[3]['constructorId'] = '117'
aston_drivers[3]
# Alfa Romeo
# Please note Alfa Romeo has raced in 1950-1951, 1979-1985, and 2019-2022 but for this project I will only include the third set of dates
url = 'https://en.wikipedia.org/wiki/Alfa_Romeo_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
alfa_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
alfa_drivers.append(df_t[0])
alfa_drivers[2] = alfa_drivers[2][['Year', 'Drivers']]
alfa_drivers[2] = alfa_drivers[2].dropna()
alfa_drivers[2].drop([2], axis =0, inplace =True)
alfa_drivers[2].drop([10], axis =0, inplace =True)
alfa_drivers[2].drop([15], axis =0, inplace =True)
alfa_drivers[2].drop(alfa_drivers[2].loc[0:9].index, axis =0, inplace =True)
alfa_drivers[2] = alfa_drivers[2].join(alfa_drivers[2]['Drivers'].str.split( expand =True).rename(
columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}))
alfa_drivers[2]['Driver1'] = alfa_drivers[2]['First1'] + ' ' + alfa_drivers[2]['Last1']
alfa_drivers[2]['Driver2'] = alfa_drivers[2]['First2'] + ' ' + alfa_drivers[2]['Last2']
alfa_drivers[2]['Driver3'] = alfa_drivers[2]['First3'] + ' ' + alfa_drivers[2]['Last3']
alfa_drivers[2] = alfa_drivers[2][['Year', 'Driver1', 'Driver2', 'Driver3']]
alfa_drivers[2] = pd.melt(alfa_drivers[2], id_vars=['Year'])
alfa_drivers[2] = alfa_drivers[2].dropna()
alfa_drivers[2]['Constructor'] = 'Alfa Romeo'
alfa_drivers[2]= alfa_drivers[2].drop(columns= ['variable'])
alfa_drivers[2] = alfa_drivers[2].rename(columns={"value": "Driver"})
alfa_drivers[2]['constructorId'] = '51'
alfa_drivers[2]
# Haas
url = 'https://en.wikipedia.org/wiki/Haas_F1_Team'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
haas_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
haas_drivers.append(df_t[0])
haas_drivers[1] = haas_drivers[1][['Year', 'Drivers']]
haas_drivers[1] = haas_drivers[1].dropna()
haas_drivers[1].drop([22], axis =0, inplace =True)
haas_drivers[1]['Constructor'] = 'Haas F1 Team'
haas_drivers[1]['constructorId'] = '210'
haas_drivers[1] = haas_drivers[1].rename(columns={"Drivers": "Driver"})
haas_drivers[1]
# Williams Racing
url = 'https://en.wikipedia.org/wiki/Williams_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
williams_drivers = []
for t in soup.findAll("table"):
df_t = pd.read_html(str(t))
williams_drivers.append(df_t[0])
williams_drivers = pd.concat([williams_drivers[4], williams_drivers[5]], ignore_index=True)
williams_drivers = williams_drivers[['Year', 'Drivers']]
williams_drivers = williams_drivers.dropna()
williams_drivers.drop([31], axis =0, inplace =True)
williams_drivers.drop([43], axis =0, inplace =True)
# Adding constructor name
williams_drivers['Constructor'] = 'Williams'
williams_drivers = williams_drivers.rename(columns={"Drivers": "Driver"})
williams_drivers['constructorId'] = '3'
williams_drivers
# COMBINING EACH OF THE DATAFRAMES FROM THE WEB SCRAPING INTO A LARGE DATAFRAME
current_drivers = pd.concat([haas_drivers[1], alfa_drivers[2],aston_drivers[3], alpha_drivers[1], alpine_drivers[1]
, mercedes_drivers[5], red_bull_drivers[3], mclaren_drivers, williams_drivers, ferrari_drivers], ignore_index=True)
current_drivers
# THIS DATAFRAME SHOWS THAT THE DRIVERS, THEIR TEAMS, AND THE YEAR FOR THE TEN CURRENT F1 TEAMS
Cleaning the lap_times dataframe & displaying dtypes for incorporation into the regression model later in this project
# RENAMING COLUMNS TO BE UNDERSTANDABLE
lap_times = lap_times.rename(columns={"time": "lap_time", "milliseconds": "lap_in_milli", "position" : "position_after_lap"})
# WE WILL ONLY BE USING MILLISECONDS, DROPPING LAP TIME IN MINUTES AND SECONDS
lap_times = lap_times.drop(columns=['lap_time'])
lap_times
# ALL OF THE VARIABLES IN LAP_TIMES SHOULD BE QUANTITATIVE. IDS, POSITIONS, AND TIMES SHOULD BE INTEGERS THAT WE CAN DO CALCULATIONS WITH
lap_times.dtypes
# RENAMING COLUMNS TO BE UNDERSTANDABLE
pit_stops = pit_stops.rename(columns={"time" : "time_when_stopped", "milliseconds": "stop_in_milli", "stop" : "stop_#"})
# WE WILL ONLY BE USING MILLISECONDS, DROPPING LAP TIME IN MINUTES AND SECONDS
pit_stops = pit_stops.drop(columns=['duration'])
pit_stops
# ALL OF THE VARIABLES IN PIT_STOPS SHOULD BE QUANTITATIVE. IDS, STOP NUMBERS, AND TIMES SHOULD BE INTEGERS THAT WE CAN DO CALCULATIONS WITH
pit_stops.dtypes
How do qualifying rounds work in Formula One?
There are three qualifying rounds for each race weekend in Formula One. These rounds take place typically over two days prior to the race on Sunday. In the first round, all twenty Formula One drivers participate. The driver's race one lap individually and attempt to recieve the fastest time. Those in the top fifteen will move on to the second round of qualifiers. Those with the bottom five times, will start race day in the last five positions on the grid. In the second round of qualifiers, drivers drive one lap, once again aiming for the faastest time. Those with the fastest ten times will move on to the third round of qualifiers. Those who had the bottom five race times in Q2 will be placed in spots 10-15 on the grid for Sunday's race. The final qualifying round determines the order of the drivers who will be in positions 1-10 on race day.
How should this information shape our data?
Since not all drivers will participate in all three rounds of qualifying, I am going to create dummy variables that will indicate (based on their success in qualifiers) whether a driver will race in the top ten, middle five, or bottom five.
# CLEANING THE QUALIFYING DATASET, DROPPING UNNEEDED VARIABLES
qualifying = qualifying.drop(columns = ["number"])
# REPLACING NAN WITH 0, THIS MEANS THAT THEY DID NOT RACE IN THAT ROUND, NOTE EVERYONE SHOULD HAVE A TIME FOR Q1 BUT NOT EVERYONE WILL MAKE IT TO Q3
qualifying = qualifying.replace('\\N','0')
# DUMMY TO INDICATE THAT THEY WERE PLACED IN TOP TEN STARTING POSITIONS (1-10)
qualifying['top_ten'] = np.where(qualifying['q3'] != '0' , 1, 0)
# DUMMY TO INDICATE THAT THEY WERE IN THE MIDDLE FIVE STARTING POSITIONS (11-15)
qualifying['middle'] = np.where((qualifying['q3'] == '0') & (qualifying['q2'] >'0'), 1, 0)
# DUMMY TO INDICARE THAT THEY WERE IN THE BOTTOM FIVE STARTING POSITIONS (16-20)
qualifying['bottom_five'] = np.where((qualifying['q3']=='0') & (qualifying['q2']=='0'), 1, 0)
# RENAMING THE COLUMN SO WE KNOW WHERE THEY STARTED
qualifying = qualifying.rename(columns= {"position" : "starting_position"})
# MERGING IN INFOMATION ABOUT WHAT RACE
qualifying = qualifying.merge(races, how='left', on="raceId")
qualifying = qualifying[['qualifyId', 'raceId', 'driverId','constructorId', 'starting_position', 'top_ten', 'middle', 'bottom_five', 'circuitId', 'year', 'name']]
qualifying
qualifying.dtypes
# ALL OF THE VARIABLES IN THE QUALIFYING DATAFRAME SHOULD BE INTEGERS AS THEY EITHER REPRESENT IDS, POSITIONS, OR DUMMY VARIABLES.
# THE ONLY EXCEPTION TO THIS RULE IS THE NAME OF THE CIRCUIT
# DROPING COLUMNS I DON'T NEED FROM THE ORIGINAL RESULTS COLUMN
results2 = results.drop(columns=['resultId', 'number', 'positionText','positionOrder', 'time'])
# RENAMING THE COLUMNS FOR UNDERSTANDABILITY
results2 = results2.rename(columns= {"grid" : "starting_position", "position":"finishing_position", "rank" : "overall_standing", "milliseconds":"time_milliseconds"})
# FILLING IN LAP TIMES THAT HAD NO DATA (INDICATING THAT THE DRIVER DID NOT RACE OR DID NOT FINISH THE RACE, WITH ZEROS)
results2 = results2.replace('\\N','0')
# CHANGING THE DTYPES TO BE QUANTITATIVE IF A NUMBER AND A INTERGER IF IT REPRESENTS A CATEGORY
results2['time_milliseconds'] = results2['time_milliseconds'].astype(int)
results2['finishing_position'] = results2['finishing_position'].astype(int)
results2['overall_standing'] = results2['overall_standing'].astype(int)
results2['statusId'] = results2['statusId'].astype(str)
results2['fastestLap'] = results2['fastestLap'].astype(int)
results2['fastestLapSpeed'] = results2['fastestLapSpeed'].astype(float)
# ADDING A DUMMY TO INDICATE THAT THE DRIVER FINISHED THE RACE
results2['finished'] = np.where(results2['statusId']=='1', 1,0)
# ADDING DUMMIES THAT INDICATE WHERE THE DRIVER PLACED
results2['podium'] = np.where(results2['finishing_position']<=3, 1,0)
results2['top_ten'] = np.where((results2['finishing_position']>3) & (results2['finishing_position']<=10), 1,0)
results2['bottom_ten'] = np.where(results2['finishing_position']>10, 1,0)
results2 = results2.drop(columns=['fastestLapTime'])
results2
results2.dtypes
# ALL OF THE VARIABLES IN THE REFINED RESULTS DATAFRAME SHOULD BE NUMERICAL AS THEY REPRESENT TIMES, IDS, AND DUMMY VARIABLES
# WE ARE KEEPING STATUSID AS AN OBJECT BECUASE THE NUMBERS REFER TO SPECIFIC CATEGORIES REGARDING HOW A DRIVER FINISHED OR WHAT MAY HAVE CAUSED THEM TO NOT COMPLETE A RACE
- Concat web scrapes & limit drivers: The first step towards my regression model is limiting the data to only include information on drivers who have raced for the 10 current Formula one teams. As mentioned in the data section, I scraped the Wikipedia pages of each team for this information. Now, I am going to create a list of unique drivers that will act as a foreign key to any dataframe I use going for
# LET'S CREATE A LIST OF THE DRIVERS WHO HAVE DRIVEN FOR THE 10 CURRENT TEAMS
names_of_drivers = current_drivers['Driver'].unique()
names_of_drivers = pd.DataFrame(data = names_of_drivers, columns = ['Driver'])
# THERE WERE A HANDFUL OF SPELLING MISTAKES IN DRIVER NAME BETWEEN THE SCRAPES AND KAGGLE
# I HAVE MANUALLY ADJUSTED THESE SO THAT MERGING WOULD BE DONE EASILY
names_of_drivers = names_of_drivers.replace("Nikita Mazepin[c]", "Nikita Mazepin")
names_of_drivers = names_of_drivers.replace("Carlos Sainz Jr.", "Carlos Sainz")
names_of_drivers = names_of_drivers.replace("Zhou Guanyu", "Guanyu Zhou")
names_of_drivers = names_of_drivers.replace("Alex Albon", "Alexander Albon")
# NOW LETS MERGE IN THE DRIVER IDS
# FIRST I HAVE TO CREATE A DATAFRAME THE CONTAINS THE NAME OF THE DRIVERS AND THEIR FULL NAMES USING THE KAGGLE DATASETS
# THEN I WILL USE THIS DATAFRAME AND MERGE IT WITH OUR UNIQUE NAMES TO CREATE A FOREIGN KEY
drivers_names = pd.DataFrame(data=drivers[['driverId', 'forename', 'surname']])
drivers_names['Driver'] = drivers_names['forename'] + ' ' + drivers_names['surname']
drivers_names = drivers_names[['driverId', 'Driver']]
names_of_drivers = names_of_drivers.merge(drivers_names, how='left')
names_of_drivers =names_of_drivers.dropna()
names_of_drivers.head()
# THERE ARE 46 UNIQUE DRIVERS WHO HAVE DRIVEN OR ARE CURRENTLY DRIVING FOR THE 10 CURRENT F1 TEAMS
2. Ranking Circuits Now that we have created a way to limit our data only to the current drivers. Lets building a ranking system for the circuits. I will build a ranking system using latitude, longitude, altitude, and average lap time. The following assumptions were made when I build my ranking model:
- I used median as opposed to average when dealing with altitude because the range was not limited like when we dealt with longitude and latitude and thus outliers skewed the average.
# LETS CREATE A RANKING SYSTEM FOR EACH OF THE CIRCUITS USED SINCE 2010
#LIMITING THE RACES FROM 2010 ONWARD AS THIS IS WHAT OUR DRIVERS WILL BE LIMITED TO
races_lim = races.loc[races['year'] >=2010]
# LIMITING THE CIRCUITS DATAFRAME BASED ON MATCHES IN THIS LIMITED SET
circuits_lim = circuits.merge(races_lim, how='inner', on = ['circuitId'])
# LIMITING THE CIRCUITS DATA FRAME TO ONLY INCLUDE THE INFORMATION WE NEED
circuits_lim= circuits_lim[['circuitId', 'circuitRef', 'lat', 'lng', 'alt']]
#TAKING THE ABS VALUE OF THE LAT AND LONG SO I CAN PULL IT INTO THE RANKING SYSTEM
# ABS VALUES ALSO ENSURES THAT BEING ABOVE OR BELOW THE EQUATOR / TO THE LEFT OR RIGHT OF THE PRIME MERIDIAN DOES NOT CANCEL OUT ANY AVERAGES
circuits_lim['lat'] = circuits_lim['lat'].abs()
circuits_lim['lng'] = circuits_lim['lng'].abs()
# DELETING DUPLICATES
circuits_lim = circuits_lim.drop_duplicates(subset=['circuitId'])
# FILLING THE THREE /N VALUES WITH THE AVERAGE ALT FOR THAT CITY
circuits_lim = circuits_lim.replace('\\N','0')
circuits_lim['alt'] = circuits_lim['alt'].astype(int)
# CREATING VARIABLES THAT WILL SHOW THE DISTANCE FROM THE AVERAGE FOR EACH GEOGRAPHICAL METRIC
circuits_lim['long_diff'] = circuits_lim['lng'].apply(lambda lng: lng - circuits_lim['lng'].mean()).abs()
circuits_lim['lat_diff'] = circuits_lim['lat'].apply(lambda lat: lat - circuits_lim['lat'].mean()).abs()
circuits_lim['alt_diff'] = circuits_lim['alt'].apply(lambda alt: alt - circuits_lim['alt'].median()).abs()
circuits_lim
# NOW LETS LIMIT THE RESULTS DATAFRAME TO ONLY INCLUDE INFORMATION ON THE CURRENT DRIVERS AND CIRCUITS FROM 2010 ONWARD
# ONCE WE HAVE THIS INFORMATION, WE CAN THEN CREATE A VARIABLE THAT AVERAGES THE LAP TIME ACROSS ALL DRIVERS AND WE CAN THEN ADD IT TO OUR RANKING MODEL
# Downselecting the results dataframe for only relevant variables
results_recent = results2[['driverId', 'raceId', 'starting_position', 'constructorId', 'finishing_position','fastestLapSpeed']]
# merge in circuit name based on raceID
results_recent = results_recent.merge(races, how='right')
results_recent = results_recent[['driverId', 'raceId', 'circuitId', 'name', 'year', 'constructorId','starting_position', 'finishing_position','fastestLapSpeed' ]]
results_recent = results_recent.dropna()
# I only want information for the list of 41 recent drivers
results_recent = results_recent.merge(names_of_drivers, how='right')
# I now want information on their lap times in milliseconds
results_recent = results_recent.merge(lap_times, how='left')
# Adding constructor Name
results_recent= results_recent.merge(constructors[['constructorId', 'constructorRef']], how = 'left', on = 'constructorId')
# limiting to 2010 onward
results_recent = results_recent.loc[results_recent['year'] >=2010]
# ADDING PITSTOP INFORMATION
results_recent = results_recent.merge(pit_stops, how='left', on = ['driverId', 'lap', 'raceId'])
results_recent = results_recent.drop(columns=['position_after_lap', 'stop_#', 'time_when_stopped'])
# TO BUILD OUR RANKING MODEL, LETS ALSO FIND THE AVERAGE LAP TIME BY CIRCUIT FOR ALL CIRCUITS USED SINCE 2010
avg_lap_overall = results_recent.groupby(['circuitId'], as_index=False)[['lap_in_milli']].mean()
avg_lap_overall = avg_lap_overall.rename(columns= {"lap_in_milli" : "avg_lap_circuit"})
results_recent= results_recent.merge(avg_lap_overall[[ 'circuitId', 'avg_lap_circuit']], how = 'left', on = ['circuitId'] )
results_recent
# LETS MERGE THE AVERAGE LAP SPEED BACK INTO THE CIRCUITS LIMITED DATAFRAME THAT WE CREATED
circuits_lim = circuits_lim.merge(results_recent[['circuitId', 'avg_lap_circuit']], how = 'inner', on = ['circuitId'] )
circuits_lim = circuits_lim.drop_duplicates(subset=['circuitId'])
# NOW THAT WE HAVE ALL THE PERTINENT INFORMATION, LETS BUILD A RANKING SYSTEM
# BECAUSE WE HAVE DATA WITH DIFFERENT RANGES, I DECIDED TO RANK EACH COLUMN INDIVUALLY AND THEN RANK THE SUM
circuits_lim["lat_rank"] = circuits_lim[["lat_diff"]].rank(method='average',ascending=True).astype(int)
circuits_lim["lng_rank"] = circuits_lim[["long_diff"]].rank(method='average',ascending=True).astype(int)
circuits_lim["alt_rank"] = circuits_lim[["alt_diff"]].rank(method='average',ascending=True).astype(int)
circuits_lim["lap_rank"] = circuits_lim[["avg_lap_circuit"]].rank(method='average',ascending=True).astype(int)
column_names = ['lat_rank', 'lng_rank', 'alt_rank', 'lap_rank']
circuits_lim['rank_sums']= circuits_lim[column_names].sum(axis=1)
# CREATING THE OVERALL RANK
circuits_lim["rank"] = circuits_lim['rank_sums'].rank(ascending = False)
# NOW I WILL CREATE A RANK FROM 1.001 - 1.034
# THIS WAY I CAN MULTIPLY DRIVER STATS GOING FORWARD IN ACCORDANCE TO HOW HARD EACH LAP WAS WITHOUT DRASTICALLY AFFECTING THE STATS
circuits_lim['rank_mult'] = circuits_lim['rank'].apply(lambda rank: (rank/1000)+1)
circuit_rank = circuits_lim[['circuitId', 'circuitRef', 'rank', 'rank_mult']]
circuit_rank
3. Yearly Averages for Drivers I will now modify the results dataframe heavily to only contain year to year averages. This was done by merging multiple different kaggle datasets, performing aggregations on specific columns, and deleting unneccesary variables. After duplicates were removed, we were left with year-to-year averages for each of our recents drivers since 2010
results_recent= results_recent[['Driver', 'driverId', 'year', 'constructorRef', 'constructorId','raceId', 'circuitId', 'name', 'lap', 'lap_in_milli', 'stop_in_milli', 'fastestLapSpeed', 'starting_position', 'finishing_position']]
# MERGING IN THE RANK MULTIPLIER FOR EACH CIRCUIT
results_recent = results_recent.merge(circuit_rank[['rank_mult', 'circuitId']], how = "inner", on = ['circuitId'])
# REPLACING THE LAP IN MILLI, STOP IN MILLI AND FASTEST LAP SPEED WITH AN INTERACTION BETWEEN THE ORIGINAL VARIABLE AND THE CIRCUIT MULTIPLIER
results_recent['lap_in_milli'] = results_recent['lap_in_milli'] * results_recent['rank_mult']
results_recent['stop_in_milli'] = results_recent['stop_in_milli'] * results_recent['rank_mult']
results_recent['fastestLapSpeed'] = results_recent['fastestLapSpeed'] * results_recent['rank_mult']
# DOING THE SAME THING TO THE STARTING AND FINISHING POSITION
results_recent['starting_position'] = results_recent['starting_position'] * results_recent['rank_mult']
results_recent['finishing_position'] = results_recent['finishing_position'] * results_recent['rank_mult']
results_recent
#NOW THAT EACH OF THE STATS ARE ADEQUATELY MULTIPLIED BY CIRCUIT DIFF, WE CAN CREATE YEAR AVERAGES
#ADDING A COLUMN THAT REPRESENTS THE AVERAGE LAP SPEED FOR A GIVEN YEAR
avg_lap = results_recent.groupby(['year', 'Driver'], as_index=False)[['lap_in_milli']].mean()
avg_lap = avg_lap.rename(columns= {"lap_in_milli" : "avg_lap"})
# MERGING BACK INTO RESULTS RECENT
results_recent= results_recent.merge(avg_lap[['year', 'Driver', 'avg_lap']], how = 'left', on = ['year', 'Driver'] )
# ADDING A COLUMN THAT REPRESENT THE AVERAGE PIT STOP FOR A GIVEN YEAR
avg_pit = results_recent.groupby(['year', 'Driver'], as_index=False)[['stop_in_milli']].mean()
avg_pit = avg_pit.rename(columns= {"stop_in_milli" : "avg_pit"})
# MERGING BACK INTO RESULTS
results_recent= results_recent.merge(avg_pit[['year', 'Driver', 'avg_pit']], how = 'left', on = ['year', 'Driver'] )
# ADDING IN INFORMATION ON AVG FASTEST LAP SPEED
avg_fastest_speed = results_recent.groupby(['year', 'Driver'], as_index=False)[['fastestLapSpeed']].mean()
avg_fastest_speed = avg_fastest_speed.rename(columns= {"fastestLapSpeed" : "avg_fastest_speed"})
results_recent= results_recent.merge(avg_fastest_speed[['year', 'Driver', 'avg_fastest_speed']], how = 'left', on = ['year', 'Driver'] )
# DROPPING THE INDIV LAP SPEEDS AND PITS
results_recent = results_recent.drop(columns=['lap_in_milli', 'stop_in_milli', 'fastestLapSpeed'])
# ADDING IN INFORMATION ON AVERAGE STARTING POSITION
avg_start = results_recent.groupby(['year', 'Driver'], as_index=False)[['starting_position']].mean()
avg_start = avg_start.rename(columns= {"starting_position" : "avg_start"})
results_recent= results_recent.merge(avg_start[['year', 'Driver', 'avg_start']], how = 'left', on = ['year', 'Driver'] )
# ADDING IN INFORMATION ON AVERAGE FINISHING POSITION
avg_finish = results_recent.groupby(['year', 'Driver'], as_index=False)[['finishing_position']].mean()
avg_finish = avg_finish.rename(columns= {"finishing_position" : "avg_finish"})
results_recent= results_recent.merge(avg_finish[['year', 'Driver', 'avg_finish']], how = 'left', on = ['year', 'Driver'] )
# DROPPING INFORMATION ABOUT STARTING AND FINISHING POSTIOIN FOR EACH RACE
results_recent = results_recent.drop(columns=['starting_position', 'finishing_position'])
# WORKING TOWARDS THE NUMBER OF WINS PER SEASON
driver_standings=driver_standings.merge(results_recent[['driverId', 'Driver', 'raceId']], on = ['raceId', 'driverId'] )
results_recent= results_recent.merge(driver_standings[['raceId', 'driverId', 'wins']], how = 'left', on = ['raceId', 'driverId'] )
max_wins =results_recent.groupby(['year', 'Driver'], as_index=False)[['wins']].max()
max_wins = max_wins.rename(columns= {"wins" : "max_wins"})
results_recent= results_recent.merge(max_wins[['year', 'Driver', 'max_wins']], how = 'left', on = ['year', 'Driver'] )
results_recent = results_recent.drop(columns=['wins'])
results_recent = results_recent.drop(columns=['raceId', 'circuitId', 'name', 'lap'])
# I AM DELETING DUPLICATES BECAUSE I WANT YEAR AVERAGES NOT RACE BY RACE DATA
results_recent= results_recent.drop_duplicates(keep='first')
results_recent
# ADDING INFORMATION ON CONSTRUCTOR STADNINGS DURING A GIVEN YEAR TO EACH DRIVER
# WEIGHTING THE CONSTRUCTORS POINTS BY CIRCUIT DIFFICULTY
race_merge = circuit_rank.merge(races[['raceId', 'circuitId', 'year']], on = ['circuitId'])
constructor_rank = race_merge.merge(constructor_standings[['raceId', 'constructorId', 'points']], on = ['raceId'])
constructor_rank['points'] = constructor_rank['points'] * constructor_rank['rank_mult']
points_szn = constructor_rank.groupby(['constructorId','year'], as_index=False)[['points']].max()
points_szn = points_szn.rename(columns= {"points" : "points_szn"})
constructor_rank= constructor_rank.merge(points_szn[['year', 'constructorId', 'points_szn']], on = ['year', 'constructorId'])
# MERGING THE POINTS THAT A CONSTRUCTOR HAD IN ONE SEASON BACK INTO RESULTS RECENT
results_recent= results_recent.merge(constructor_rank[['year', 'constructorId', 'points_szn']], on = ['year', 'constructorId'])
results_recent= results_recent.drop(columns = ['rank_mult'])
results_recent= results_recent.drop_duplicates()
results_recent
For reference throughout the analysis, I have created a table that shows which team our recent drivers have raced for since 2010. If there is a NaN in this table, it represents that the driver was not driving this year
# I want to create a graph that will show which drivers have switched teams within this dataset
# Creating a pivot table containing information on year, driver, and the team they were racing for
results_pivot = results_recent.pivot_table(
index="Driver", columns =['year'],
values = "constructorRef", aggfunc=np.max
)
results_pivot
- Based on the table above, I picked four drivers and have decided to do a regression discontinuity analysis on five important success metrics: average lap time, average pitstop time, average fastest lap speed, average starting position, and average finishing position. These averages are in relation to the year as a whole. I choose the four drivers based on their noterity, the amount of data they had and whether they switched teams at least once in their career. In the end, I choose Max Verstappen, Lewis Hamilton, Sebastian Vettel, and Daniel Ricciardo
First I will create a graph that shows how these four drivers switched teams over the year. This graph is purely for visuals.
results_four = results_recent.loc[(results_recent['Driver']=="Daniel Ricciardo") | (results_recent['Driver']== "Lewis Hamilton" )| (results_recent['Driver']== "Max Verstappen") | (results_recent['Driver']== "Sebastian Vettel")]
# Now we are going to create a line plot that shows how these drivers moved over the years
# To limit how crezy the graph is, I am limiting my regression discontinuity to four of the drivers. The drivers had more data and at least one team switch
plt.figure(figsize=(40,20))
sns.lineplot(data=results_four, x="year", y="constructorRef", hue="Driver", style = "Driver", markers = True, estimator=None,linewidth=10).set(title='The Four Drivers and their team history since 2010')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 25})
- Prove Regression Discontinuity for Constructors: I will now analyze these four drivers and see how their averages were affected pre and post switches
In 2015 Max Verstappen was racing for Torro Rosso, who was ranked 7th overall. In 2016, he switched teams and raced for RedBull, who was ranked 4th in 2015. We want to see if Max's average lap time, average pitstop time, fastest Lap speed and other season indicators were affected by switching to a 'better' team.
max_verstappen = results_recent.loc[(results_recent['Driver'] == 'Max Verstappen')]
max_verstappen
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Max Verstappen Yearly Averages (centered at 2016)")
ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_pit", ax=ax)
ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)
ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_start", ax=ax)
ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_finish", ax=ax);
In 2014 Sebastian Vettel was racing for Red Bull, who was ranked 2nd overall. In 2015, he switched teams and raced for Ferrari, who was ranked 4th in 2014. We want to see if Sebastian's average lap time, average pitstop time and fastest Lap speed was affected by switching to a 'lesser' team.
Additionally, in 2020 Sebastian was racing for Ferrari, who was ranked 6th overall. In 2021, he switched teams and raced for Aston Martin, who would be new to the grid in 2021. We want to see how this new team affected hisaverage lap time, average pitstop time and fastest Lap speed.
seb_vettel = results_recent.loc[(results_recent['Driver'] == 'Sebastian Vettel')]
seb_vettel
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Sebastian Vettel's Yearly Averages (centered at 2015 & 2021)")
ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_pit", ax=ax)
ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)
ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_start", ax=ax)
ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_finish", ax=ax);
When Sebastian switched from RedBull (2) to Ferrari (4), his average lap time, average pitstop time decreased and his fastest lap speed remained unaffected. When Sebastian switched from Ferrari (6) to Aston Martin (unranked) his average lap time, average pitstop time, and average fastest lap speed decreased.
In 2012 Lewis Hamilton was racing for McLaren, who was ranked 3rd overall. In 2013, he switched teams and raced for Mercedes, who was ranked 5th in 2012. We want to see if Lewis's average lap time, average pitstop time and fastest Lap speed was affected by switching to a 'lesser' team.
hamilton = results_recent.loc[(results_recent['Driver'] == 'Lewis Hamilton')]
hamilton
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Lewis Hamilton Yearly Averages (centered at 2013)")
ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_pit", ax=ax)
ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)
ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_start", ax=ax)
ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_finish", ax=ax);
Daniel Ricciardo has has spent his relatively short Formula One career split between five teams: HRT, Torro Rosso, RedBull, Renault, and McLaren. For simplicity, I will only analyze the largest jumps of his career: RedBull to Renault. After driving for RedBull for five years, Ricciardo switched in 2019 from a team ranked 3rd to Renauly who was ranked 5th. Although this may not seem like a large jump, the top three teams in Formula One consistently score the best and the remaining seven tend to fall to the wayside. It is uncommon to see a driver in their prime voluntarily switch to a lower team and thus I want to see how Ricciardo's average lap time, average pitstop, and fastest lap times were affected.
ricciardo = results_recent.loc[(results_recent['Driver'] == 'Daniel Ricciardo') & ((results_recent['year']>=2014) & (results_recent['year']<=2020))]
ricciardo
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Daniel Ricciardo's Yearly Averages (centered at 2019)")
ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_pit", ax=ax)
ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)
ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_start", ax=ax)
ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_finish", ax=ax);
For the most part, the drivers and their switches followed our expectations. When Max Verstappen switched to a 'better' team, his success indicators were immediately affected positively and he continued to have larger positive affect each year. When Daniel Ricciardo switched to a 'worse; team, his success indicators were negatvily affected in the ensuring season.
There were certain anomolies in Lewis Hamilton's switch from McLaren to Mercedes. Although Mercedes was a lower ranking team in the previous season, Lewis's switch had positive affects on his success indicators.
Additionally, Sebestian Vettel's first switch from RedBull to Ferrari did not follow expectations, but his second switch from Ferrari to Aston Martin did follow expectations.
What we can see from this is that constructors do play a role in success, however Formula One is a highly volitile sport and teams can change drastically year to year, espicially ones with a strong historical footprint like Mercedes or Ferrari.
Going forward, I will still incorporate constructor success into my model but I will not rely on it solely.
results_recent.sort_values(by=['Driver', 'year'], inplace=True)
results_recent['prev_szn(avg_lap)'] = results_recent['avg_lap'].shift(1)
results_recent['prev_szn(avg_pit)'] = results_recent['avg_pit'].shift(1)
results_recent['prev_szn(avg_fastest_speed)'] = results_recent['avg_fastest_speed'].shift(1)
results_recent['prev_szn(avg_start)'] = results_recent['avg_start'].shift(1)
results_recent['prev_szn(avg_finish)'] = results_recent['avg_finish'].shift(1)
results_recent['prev_szn(constructor)'] = results_recent['constructorRef'].shift(1)
results_recent['prev_szn(points_szn)'] = results_recent['points_szn'].shift(1)
# WE NEED TO GET DUMMIES FOR EACH OF THE INSTRUCTORS
results_recent.dropna(inplace=True)
results_recent
# PICKING THE FEATURES I WANT TO USE IN MY MODEL
features = ["constructorRef", "prev_szn(avg_lap)",
"prev_szn(avg_pit)",
"prev_szn(avg_fastest_speed)", "prev_szn(avg_start)",
"prev_szn(avg_finish)", "prev_szn(constructor)", "driverId", "points_szn"]
# CHANGING ANY CATEGORICAL VARIABLE INTO A DUMMY
X_dict = results_recent[features].to_dict(orient="records")
# SPECIFYING WHAT MY OUTCOME VARIABLE WILL BE
y = results_recent["avg_finish"]
# SINCE WE WILL BE USING CROSS VALIDATION, MUST SPECIFY THE PIPELINE
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=5)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
# ESTABLISH THE FORMULA WE WILL USE FOR FIVE FOLD CROSS VALIDATION
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mse = np.mean(-cross_val_score(
pipeline, X_dict, y,
cv=5, scoring="neg_mean_squared_error"
))
return mse
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line()
test_errs.sort_values()
Initially, I thought it would be wise to optimize the KNN using cross fold validation because I wanted to minimize the error in the model, however, I realized with a small sample of drivers and construcotrs, the larger K got, the more similar all of the predicted finishing positions got. I've included this in the project, but wanted to first explain why I would not be using the predicted K going forward
For the training set, I used lagged success indicators for each of the drivers. I also limited the training set to not include 2022, as this would be what I tested my predictions on.
# ESTABLISHING X TRAIN, Y TRAIN, AND X TEST
features = ["year", "driverId", "prev_szn(avg_lap)",
"prev_szn(avg_pit)",
"prev_szn(avg_fastest_speed)", "prev_szn(avg_start)",
"prev_szn(avg_finish)", "prev_szn(constructor)", "points_szn"]
X_train = pd.get_dummies(results_recent[features])
# I AM INCLUDING 2022 BUT ONLY USING LAGS IN MY MODEL SO THAT THE RACERS IN 2022 CAN STILL PULL THEIR LAG FROM 2021
X_train = X_train.loc[X_train['year']<2022]
y_train = results_recent[["avg_finish", "year"]]
y_train = y_train.loc[y_train['year']<2022]
y_train = y_train.drop(columns=['year'])
X_test = pd.get_dummies(results_recent[features])
X_test = X_test.loc[X_test['year'] ==2022]
from sklearn.neighbors import KNeighborsRegressor
KNN_model = KNeighborsRegressor(n_neighbors=2).fit(X_train,y_train)
X_test['predicted_avg_finish'] = KNN_model.predict(X_test)
predictions = X_test[['driverId', 'prev_szn(avg_finish)', 'predicted_avg_finish']]
predictions
# MERGING IN DRIVER NAME AND ACTUAL FINISH FOR 2022
results_2022 = results_recent.loc[results_recent['year']==2022]
predictions = predictions.merge(results_2022[['Driver', 'driverId', 'year', 'avg_finish']], how="inner", on = ['driverId'])
predictions
We know that George Russel switched from Williams to Mercedes in 2022. We will use his data from 2022 and see what our model predicts his average finishing position to be. Our model predicted that Russell would average a finishing position of 11.72 but he actually finished with average position of 4.27. This could be because our model fails to accurately weight the affect that switching to a better team has on a driver's success. Another switch was Vatteri Battas who transferred from Mercedes to Alfa. Our model predicted that he would have an average finishing position of 11.06 but he actually finished with an average finish of 9.05. This is a much closer gap and reflects the negative affect of switching to a worse team. This higher prediction could be caused by his high finishing position in the previous season.
All in all, I have come to the conclusion that it is very difficult to finetune finishing position. My model was limited by the amount of public data available, the fact that constructors remodel their car year to year, and that when we use lags, we cannot correctly gauge the affect of a switch because our model is using metrics from the previous season (and thus the previous team) to predict how a driver will perform with a new team. These limitation led to large gaps between predicted average finish and average finish. Additionally, our model was also limited by the fact that there are only 20 racers within a given season and thus using a large KNN centers all of the predictions around 7-10 regardless of the driver and their personal ability.
# Converting my CoLab to html so that I can upload this file to Github
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/finaltutorial.ipynb